ftp.cs.arizona.edu

home *** CD-ROM | disk | FTP | other *** search

/ ftp.cs.arizona.edu / ftp.cs.arizona.edu.tar / ftp.cs.arizona.edu / icon / newsgrp / group98c.txt / 000022_icon-group-sender _Mon Sep 14 08:24:36 1998.msg < prev next >

Wrap

Internet Message Format | 2000-09-20 | 14KB

Return-Path: <icon-group-sender> Received: from kingfisher.CS.Arizona.EDU (kingfisher.CS.Arizona.EDU [192.12.69.239]) by baskerville.CS.Arizona.EDU (8.9.1a/8.9.1) with SMTP id IAA06355 for <icon-group-addresses@baskerville.CS.Arizona.EDU>; Mon, 14 Sep 1998 08:24:35 -0700 (MST) Received: by kingfisher.CS.Arizona.EDU (5.65v4.0/1.1.8.2/08Nov94-0446PM) id AA01514; Mon, 14 Sep 1998 08:24:08 -0700 From: gep2@computek.net Date: Sat, 12 Sep 1998 14:19:41 -0500 (CDT) Message-Id: <199809121919.OAA18685@mail.cmpu.net> Mime-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit Subject: Re: Unicode support or support for non-Ascii based character manipulation? To: icon-group@optima.CS.Arizona.EDU X-Mailer: SPRY Mail Version: 04.00.06.17 Content-Transfer-Encoding: 7bit Content-Transfer-Encoding: 7bit Errors-To: icon-group-errors@optima.CS.Arizona.EDU Content-Transfer-Encoding: 7bit Status: RO >> Okay, I don't dispute that this move is happening but personally I still don't very much like it. The fact is that (at least here in the Western Hemisphere, where probably most of the world's computers are used) an eight-bit byte is already quite sufficient for most purposes, and doubling it comes at a cost in complexity and storage (RAM, disk, tape, whatever) which is simply very, very hard to justify on any genuine economic basis. > This is a fictitious problem. Which? Most of the points there are not subject to dispute, at least for most of us here in the USA. a) That I don't very much like it? b) That most of the world's computers are used in the Western Hemisphere? c) That an eight-bit byte is quite sufficient HERE for most (I didn't say ALL) purposes? d) That doubling it to a sixteen-bit byte comes at a cost (I didn't say a HUGE cost, but it IS a cost) in complexity and storage? e) That such a cost is hard to justify (again, for MOST purposes, in particular for business and most typical home use) given the limited or only specialized need for a bunch of exotic characters that probably 95% of the Western world's PC users are likely to never use? > UNIX systems at least... ...which represent something like 4% of machines sold, and it looks like NT 5.0 will continue to erode corporate use of Unix... > ...support UTF-8, which is a compression method described in ISO 10646 and the Unicode book that has the property that ASCII characters *still* occupy exactly one byte each. Okay, but this still results in more complex file formats and the need for suitable compression and decompression routines, and/or the use of mixed-mode processing in handling strings and/or doubling storage requirements for such strings while they are in memory (and thus obsoleting a lot of existing tools, library routines, and other programming). We've already talked about some of the issues regarding Icon implementation, and while probably not insurmountable (indeed, I think that a fully Unicode-supporting Icon implementation... NOT to replace the normal one!!... might be a very popular tool among those people who for whatever reason decide to use Unicode.) > When I use getwc() on this system, it decodes UTF-8 files and gives me ISO 10646 wide characters internally. Which means I presume that those characters internally take twice the storage they would otherwise. Thus at a cost of storage, and with the disadvantage that (barring some kind of new machine architecture at least where there is a NATIVE 16-byte byte I suppose, without direct addressability to address increments smaller than that) programming must change to account for the fact that all bytes are now byte PAIRS and that alignment issues suddenly become of prime importance. >> If other countries have more difficult (or huge) character sets, that is (while a fact of life) simply an inherent disadvantage of their culture (and note that I'm not intending that as a slam or value judgement, it just IS the way it is), and I don't see a terribly convincing argument why the other countries (without that disadvantage) ought to pay the price too, just in order to artificially level the playing field. > Many people _within_ Weestern Europe do not have the luxury of dealing with only a single language. Sure, but I'll point out that the great majority of them (and here I'm talking about typical business and home users, I'm not talking about academic types who ABSOLUTELY have to have a whole assortment of Armenian, Sanskrit and other highly specialized fonts for their scholarly work) do rather okay with the systems they're presently using. > I cannot write my father's name in ASCII, nor my sister-in-law's. Both of them are (in my father's case, were) monoglot Anglophones born into monoglot Anglophone families in an English-speaking country. I _can_ write their names in ISO Latin-1, but I _can't_ write half of the place-names of this country! I note that you don't mention WHICH country you're talking about. Of course, I suppose I could buy an island somewhere and name it some new name using some bizarre alphabet, and then ask everyone in the world to adjust all their systems to support my new alphabet! When most immigrants came to the USA during the latter half of the previous century (and the first twenty or so years of this one) a LOT of them changed the spelling and writing of their names. Hey, I can't address a letter to Peking/Beijing/whatever from my computer these days using the *REAL* name of the city, spelling the name the way the local residents do, either. Even among Western countries, a Parisian sending a letter to London will usually address it as "Londres", and most Americans writing to a friend in Cologne, Germany will address it that way rather than "Koln" (yeah, I know that they put the double-dot over the "o" too). But you know something? All of those letters WILL be delivered just fine to the recipients in Beijing, London, or Cologne, because we NORMALLY deal (and generally reasonably well) with these differences of the way that different world peoples call each other's countries. Not just when the names are different, but also when the alphabets are different. I'm sure I could write a letter to someone in an Arab country using a Western, non-Arab alphabet and still get it delivered. Despite the fact that locally written letters are doubtless addressed in Arabic. The post office there can handle BOTH (and better, I'm sure, than the US post office could deal with a letter addressed to someone HERE in Arabic!). > (The officially approved orthography for Maori puts a macron over long vowels, like the 'a' in Maori. There are no macrons in Latin-1.) Even if my text switched between Latin-1 family members, I _still_ wouldn't be able to write English, because the inverted comma and and double inverted comma quotation marks are not available, let alone en dashes and em dashes. Frankly, I think the double quote and apostrophe work just fine for most people. So to say that you "can't write English" is fairly ridiculous. In fact, what will probably happen is that these archaic inconveniences will probably simply fade away, due precisely to the fact that they aren't widely supported and most people simply couldn't care less. > The *only* character set around in which this functionally-monoglot Anglophone can write *in English* about the people and places around him is ISO 10646; even Latin-1 just isn't good enough FOR ENGLISH! Frankly, I think that the great majority of your audience will probably do just fine with a "close approximation". My neighbor and wonderful friend in Paris was Russian (in fact, he's on this list... HI Vlad!) but he didn't seem to be terribly upset that he couldn't write his name there spelled using the Cyrillic characters he'd grown up with. What's important for most people is that they communicate successfully with the people that are important to them, and most of the time we do that pretty well. Frankly, if you told most Americans that they weren't writing proper English because they didn't use inverted commas and double inverted comma quotation marks, or properly use en dashes and em dashes, I suspect that they'd look at you with disbelief as if you were from Mars or something, and tell you to get a life. > I also note that Icon (like SNOBOL before it) has been of particular interest to scholars in the humanities, who would, for example, like to put Hebrew _and_ Arabic in the same document with English, which is something you can't do in any ISO 8859 family member, not without code switching, which is much harder to deal with than Unicode. Obviously scholars who worry about such issues have a variety of specialized word processors and other such software to deal with their multi-lingual, multi-alphabet requirements (and that's as it should be, probably). Again, as I've mentioned in other posts, there are a whole series of issues that go way beyond simply having enough characters in the character set.for "everyone's" characters to be there in direct, native mode. Some languages write right-to-left in horizontal rows (Hebrew for example), and some languages write top to bottom and then to the left in vertical rows (Japanese for instance). Trying to mix these styles in the same document and on the same line is complex at minimum and very frustrating for typical users (when using such word processors, the simple use of the left and right arrow keys to move the cursor certainly doesn't obey the "principle of least astonishment" as it's known to most of us!). > There is the pretty obvious point that within Europe, they are going to *have* to use the new "Euro" sign. (Why have the Europeans named their new currency after an Australian mammal?) That's U+20AC, and if there's an 8-bit character set that has it, please tell us which. You're being ridiculous, since OBVIOUSLY they have created a NEW character EXPRESSLY for the purpose of it being new. Clearly it's not part of *any* previously-existing character set. (For that matter, it wasn't part of Unicode EITHER before they created it and got it added). Even once the character is added officially to the CHARACTER SET, even that doesn't really begin to solve the problem. Because now you have to address the issue of how you're going to ENTER it (keyboard?), and how you're going to DISPLAY it. There are (at least!) tens of thousands of fonts out there, and *none* of them will have these newly-created characters in them. I'd hate to even think of a TrueType font for "all" of Unicode's characters. Let alone a full set of fonts for all the different type styles and variants. These fonts (for those of us that tend to collect a lot of them) take up too much space on hard disks as it is. >> I can certainly understand and appreciate the problems that the huge character sets used in some eastern countries have played for them > Never mind eastern countries. What about an American businessman writing to an office in Germany about their operations in Russia? Straw man. These communications take place just fine today, without using Cyrillic. > What about a theologian writing in English but quoting Hebrew and Greek frequently? That's of academic interest but (HIGHLY specialized) academic needs should NOT force businesses and typical home users to pay more to support the needs of a VERY small percentage (at least until you get REAL far away) of other users. > What about an English professor writing a book in modern English about Old English (we've lost four letters, which can be found in Unicode but not any 8-bit character set I know of. Ash _is_ in Latin1, but eth, thorn, yogh, and wynn are not.) Again, most of us could care less. He's (or she's) welcome to deal with that issue however they like. The current system has NOT precluded such scholarly research up to now, so I don't see why this is such a big issue all of a sudden. > By the way, 16 bits isn't enough; there are proposals already far advanced in the pipeline for characters to go into Plane 1. And that starts to get even more ridiculous. As I said, it's a slippery slope when you decide that everyone has to be able to support EVERYBODY else's needs, even when for most people they are TOTALLY IRRELEVANT. I would imagine that someone has even assigned "official" Unicode character assignments to Klingon characters! So are OTHER people going to start dreaming up their own weird alphabets and asking the rest of the world to jump through hoops supporting those, too? Frankly, I'm never going to need to read (OR WRITE!) Armenian. I'm even unlikely to read or write most Asian languages, or Hebrew, or numerous others which are important to many people SOMEWHERE on the globe. And frankly, I think most of my consulting clients' needs are served just fine by "normal" ASCII. It is ludicrous to expect them to put up with extra cost and complexity in their business to support something that they don't need, don't want, and in fact would have *no* use for whatsoever. People who DO have special requirements (and I'm not disputing that there ARE such persons) should, alternatively, EXPECT to deal with the extra costs and the additional hassles that their special needs demand. Gordon Peterson http://www.computek.net/public/gep2/ Support the Anti-SPAM Amendment! Join at http://www.cauce.org/